In this R Notebooks, I will explore the distribution of Zipcodes accross DMAs. When we map the DMAs based on the zipcodes, we get that a given DMA is widespread accross the US as shown in the map below. This is noise in the data. But what percentage is noise?
dma_zip
For a given zipcode, the DMA that it most frequently is associated with is the correct zipcode-DMA association. All the rest are outliers. Therefore, we will calculate the percentage of occurance of a given zipcode accross different DMAs to detect the outliers.
library(data.table)
library(DT)
library(plotly)
library(tidyverse)
df <- fread("Documents/gannet/clustering/EDA/traffic_by_host_status_dma.csv")
df = df %>% select(geo_dma, geo_zip, count)
The table below shows sample of records of geo_zip and geo_dma with their count. These are the most frequent ones.
dma_zip = df %>% group_by(geo_dma, geo_zip) %>% summarise(count_dma_zip = sum(count)) %>% arrange(desc(count_dma_zip))
datatable(dma_zip %>% head())
The table below shows the total instances by zipcode. It will help us to analyze what percentage of records of a given zipcode is found in a given DMA.
zip = df %>% group_by(geo_zip) %>% summarise(count_zip = sum(count)) %>% arrange(desc(count_zip))
datatable(zip %>% head())
Now, let’s calculate percentage of a given zipcode records across a DMA.
joined = dma_zip %>% inner_join(zip, by = 'geo_zip')
joined = joined %>% mutate(percentage = count_dma_zip/count_zip * 100) #%>% arrange(desc(percentage))
data.table(joined)
Let’s see the distibution of the percentages.
g1 = joined %>% ggplot(aes(x = percentage)) + geom_histogram()
ggplotly(g1)
We see two peaks near 0 and 100. Let’s use log scale to further understand the distribution.
g2 = joined %>% ggplot(aes(x = log10(percentage))) + geom_histogram()
ggplotly(g2)
joined %>% filter(percentage > 95) %>% distinct(geo_zip) %>% nrow()
[1] 32793
df %>% select(geo_zip) %>% distinct() %>% nrow()
[1] 33089
We see that 296 zipcodes (33089-32793), which is about 0.9%, have less than 95% maximum percentage of occrance in a given DMA.
z1 = df %>% select(geo_zip) %>% dplyr::distinct()
z2 = joined %>% filter(percentage > 95) %>% dplyr::distinct(geo_zip)
less_95 = z1 %>% anti_join(z2,by = "geo_zip")
g2 = joined %>% semi_join(less_95) %>% dplyr::group_by(geo_zip) %>% summarise(percentage = max(percentage)) %>%
ggplot(aes(x = percentage)) + geom_histogram()
ggplotly(g2)
z1 = joined %>% dplyr::group_by(geo_zip) %>% summarise(percentage = max(percentage))
outliers = joined %>% anti_join(z1, by = c("geo_zip", "percentage")) %>% select(geo_dma, geo_zip, count_dma_zip, percentage)
cat('There are ', sum(outliers$count_dma_zip), 'outlier zipcodes out of a total of', sum(df$count), 'records')
There are 1123285 outlier zipcodes out of a total of 451653467 records
cat('This is', sum(outliers$count_dma_zip)/sum(df$count)*100, 'percent')
This is 0.2487051 percent
So, there are less than 0.25% of zipcodes outside of their DMA (outliers) assuming that those with maximum percentages for each geo-zip and geo-dma combination are correct and after filtering them out we get the map below.